Introduction

This lab introduces you to some of the basic functions in R for plotting and analyzing univariate time series data. Many of the things you learn here will be relevant when we start examining multivariate time series as well. We will begin with the creation and plotting of time series objects in R, and then moves on to decomposition, differencing, and correlation (e.g., ACF, PACF) before ending with fitting and simulation of ARMA models.


Datasets

We’ll use two publicly available environmental datasets in this lab. The main one is a time series of the atmospheric concentration of CO2 collected at the Mauna Loa Observatory in Hawai’i (ML_CO2.csv). The second is Northern Hemisphere land and ocean temperature anomalies from NOAA. (NH_temp). You can download both of them from GitHub via the following code.

## Atmospheric CO2 measured on Mauna Loa, Hawai'i
CO2 <- read.csv("https://raw.githubusercontent.com/SOE592/website/main/lectures/day_01/data/ML_CO2.csv")

## Northern hemisphere temperature anomolies
NH_temp <- read.csv("https://raw.githubusercontent.com/SOE592/website/main/lectures/day_01/data/NH_temp.csv")

Time series objects

The CO2 data are stored in R as a data.frame object, but we would like to transform the class to a more user-friendly format for dealing with time series. Fortunately, the ts() function will do just that, and return an object of class ts as well.

Tip: Type ?ts at the command prompt to see all of function arguments.

In addition to the data themselves, we need to provide ts() with two pieces of information about the time index for the data:

  1. frequency is a bit of a misnomer because it does not really refer to the number of cycles per unit time, but rather the number of observations/samples per cycle. So, for example, if the data were collected every hour of a day then frequency = 24.

  2. start specifies the starting time or date for the first data point in terms of c(unit, subunit).

So, for example, if the data were collected monthly beginning in November of 1999, then frequency = 12 and start = c(1999, 11). If the data were collected annually, then you simply specify start as a scalar (e.g., start = 1991) and omit frequency (i.e., R will set frequency = 1 by default).

The Mauna Loa time series is collected monthly and begins in March of 1958, which we can get from the data themselves, and then pass to ts().

## create a time series (ts) object from the CO2 data
co2 <- ts(data = CO2$ppm, frequency = 12,
          start = c(CO2[1, "year"], CO2[1, "month"]))

Combining multiple ts objects

Before we examine the CO2 data further, let’s see a quick example of how you can combine multiple time series together. We’ll use the data on monthly mean temperature anomolies for the Northern Hemisphere (NH_temp).

Task: Begin by converting NH_temp to a ts object.

## convert temperature data to ts object
temp_ts <- ts(data = NH_temp$Value, frequency = 12,
              start = c(1880, 1))

We need a way to line up the time indices because the temperature data start in January of 1880, but the CO2 data start in March of 1958. Fortunately, the ts.intersect() function makes this really easy once the data have been transformed to ts objects by trimming the data to a common time frame. Also, ts.union() works in a similar fashion, but it pads one or both series with the appropriate number of NA.

Task: Compare the results of ts.intersect() and ts.union().

## intersection (only overlapping times)
dat_int <- ts.intersect(co2, temp_ts)

## dimensions of common-time data
dim(dat_int)
## [1] 682   2
## union (all times)
dat_unn <- ts.union(co2, temp_ts)

## dimensions of all-time data
dim(dat_unn)
## [1] 1647    2

As you can see, the intersection of the two data sets is much smaller than the union. If you compare them, you will see that the first 965 rows of dat_unn contains NA in the co2 column.


Time series plots

Time series plots are an excellent way to begin the process of understanding what sort of process might have generated the data of interest. Traditionally, time series have been plotted with the observed data on the \(y\)-axis and time on the \(x\)-axis. Sequential time points are usually connected with some form of line, but sometimes other plot forms can be a useful way of conveying important information in the time series (e.g., barplots of sea-surface temperature anomolies show nicely the contrasting El Niño and La Niña phenomena).

We can use the base function plot.ts() to plot a time series, which is designed specifically for ts objects like the one we just created above. It’s nice because we don’t need to specify any x-values as they are taken directly from the ts object. The actual syntax is a bit odd because you specify x = values when actually x is the time index (i.e., length of the time series) and y are the values to plot.

## plot the ts
plot.ts(co2)
Time series of the atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i measured monthly from March 1958 to present.

Time series of the atmospheric CO2 concentration at Mauna Loa, Hawai’i measured monthly from March 1958 to present.


Tip: You can use the expression() function to create nicer plot labels.

## plot the ts
plot.ts(co2, ylab = expression(paste("CO"[2], " (ppm)")))
Time series of the atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i measured monthly from March 1958 to present.

Time series of the atmospheric CO2 concentration at Mauna Loa, Hawai’i measured monthly from March 1958 to present.

Aside

Examination of the plotted time series shows two obvious features that would violate any assumption of stationarity:

  1. an increasing (and perhaps non-linear) trend over time, and

  2. strong seasonal patterns.

Do you know the causes of these phenomena?

Plotting multiple ts objects

You can plot multiple time series with plot.ts() by passing a ts object with multiple time series joined via ts.intersect() or ts.union(). You can also use cbind() to join them.

Task: Plot the intersection of the CO2 and temperature data.

## plot the ts
plot(dat_int)
Time series of the atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i (top) and the mean temperature index for the Northern Hemisphere (bottom) measured monthly from March 1958 to present.

Time series of the atmospheric CO2 concentration at Mauna Loa, Hawai’i (top) and the mean temperature index for the Northern Hemisphere (bottom) measured monthly from March 1958 to present.


Tip: The regular plot() function in R is smart enough to recognize a ts object and use the information contained therein appropriately.

Task: Plot the intersection of the two time series together with no title and the y-axes on alternate sides.

## plot the ts
plot(dat_int, main = "", yax.flip = TRUE)
Time series of the atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i (top) and the mean temperature index for the Northern Hemisphere (bottom) measured monthly from March 1958 to present.

Time series of the atmospheric CO2 concentration at Mauna Loa, Hawai’i (top) and the mean temperature index for the Northern Hemisphere (bottom) measured monthly from March 1958 to present.


Decomposition of time series

We typically write a decomposition model as

\[ x_t = m_t + s_t + e_t, \]

where, at time \(t\), \(m_t\) is the trend, \(s_t\) is the seasonal effect, and \(e_t\) is a random error that we generally assume to have zero-mean and to be correlated over time. Thus, by estimating and subtracting both \(\{m_t\}\) and \(\{s_t\}\) from \(\{x_t\}\), we hope to have a time series of stationary residuals \(\{e_t\}\).

Estimating seasonal effects

Once we have an estimate of the trend (\(\hat{m}_t\)) we can easily obtain an estimate of the seasonal effect (\(\hat{s}_t\)) by subtraction

\[ \hat{s}_t = x_t - \hat{m}_t, \]

Task: Calculate the time series of seasonal effects.

## seasonal effect over time
co2_seas <- co2 - co2_trend

Tip: You can see how the estimate of the seasonal effect contains the random error by plotting the time series.

## plot the monthly seasonal effects
plot.ts(co2_seas, ylab = "Seasonal effect plus error", cex = 1)
Time series of seasonal effects plus random errors for the atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i, measured monthly from March 1958 to present.

Time series of seasonal effects plus random errors for the atmospheric CO2 concentration at Mauna Loa, Hawai’i, measured monthly from March 1958 to present.

Mean seasonal effects

We can obtain the mean seasonal effects by averaging the estimates of \(\{\hat{s}_t\}\) for each month and repeating this sequence over all years.

## length of ts
ll <- length(co2_seas)

## frequency (ie, 12)
ff <- frequency(co2_seas)

## number of periods (years); %/% is integer division
periods <- ll %/% ff

## index of cumulative month
index <- seq(1, ll, by = ff) - 1

## get mean by month
mm <- numeric(ff)
for (i in 1:ff) {
  mm[i] <- mean(co2_seas[index + i], na.rm = TRUE)
}

## subtract mean to make overall mean = 0
mm <- mm - mean(mm)

Task: Plot the average monthly effects to see what is happening within a year.

## plot the monthly seasonal effects
plot.ts(mm, ylab = "Seasonal effect", xlab = "Month", cex = 1)

As expected, the CO2 concentration is highest in spring (March) and lowest in summer (August).

Estimated monthly seasonal effects for the atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i.

Estimated monthly seasonal effects for the atmospheric CO2 concentration at Mauna Loa, Hawai’i.

Task: Create the entire time series of seasonal effects \(\{\hat{s}_t\}\).

## replicate the monthly means over all years
co2_seas_avg <- rep(mm, periods + 1)[seq(ll)]

## create ts object for season
co2_seas_ts <- ts(co2_seas_avg,
                  start = start(co2_seas),
                  frequency = ff)

Completing the model

The last step in completing our full decomposition model is obtaining the random errors \(\hat{e}_t\), which we can get via simple subtraction

\[ \hat{e}_t = x_t - \hat{m}_t - \hat{s}_t. \]

Task: Calculate the time series of error terms.

## random errors over time
co2_err <- co2 - co2_trend - co2_seas_ts

Task: Plot all 3 of the estimated model components along with the observed data \(\{x_t\}\).

## plot the obs ts, trend & seasonal effect
plot(cbind(co2, co2_trend, co2_seas_ts, co2_err), main = "", yax.flip = TRUE)
Time series of the observed atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i (top) along with the estimated trend, seasonal effects, and random errors.

Time series of the observed atmospheric CO2 concentration at Mauna Loa, Hawai’i (top) along with the estimated trend, seasonal effects, and random errors.


Using decompose()

Now that we have seen how to estimate and plot the various components of a classical decomposition model in a piece-wise manner, let’s see how to do it all in one step using the function decompose(). decompose() accepts a ts object as input and returns an object of class decomposed.ts.

Task: Use decompose() to decompose the CO2 data.

## decomposition of CO2 data
co2_decomp <- decompose(co2)

co2_decomp is a list with the following familiar elements:

  • x: the observed time series \(\{x_t\}\)
  • seasonal: time series of estimated seasonal component \(\{\hat{s}_t\}\)
  • figure: mean seasonal effect (length(figure) == frequency(x))
  • trend: time series of estimated trend \(\{\hat{m}_t\}\)
  • random: time series of random errors \(\{\hat{e}_t\}\)
  • type: type of error ("additive" or "multiplicative")

Task: Plot the estimated components and compare them to the above results obtained the long way.

## plot the obs ts, trend & seasonal effect
plot(co2_decomp, yax.flip = TRUE)
Time series of the observed atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i (top) along with the estimated trend, seasonal effects, and random errors obtained with the function `decompose()`.

Time series of the observed atmospheric CO2 concentration at Mauna Loa, Hawai’i (top) along with the estimated trend, seasonal effects, and random errors obtained with the function decompose().

Success: The results obtained with decompose() are identical to those we estimated previously!

Tip: Another nice feature of the decompose() function is that it can be used for decomposition models with multiplicative (i.e., non-additive) errors (e.g., if the original time series had a seasonal amplitude that increased with time). To do, so pass in the argument type = "multiplicative", which is set to type = "additive" by default.


Differencing a time series

An alternative to decomposition for removing trends and seasonal patterns is so-called differencing. We saw in lecture how the difference operator works and how it can be used to remove linear and nonlinear trends as well as various seasonal features that might be evident in the data. As a reminder, we define the difference operator as

\[ \nabla x_t = x_t - x_{t-1}, \]

and, more generally, for order \(d\)

\[ \nabla^d x_t = (1-\mathbf{B})^d x_t, \] where \(\mathbf{B}\) is the backshift operator.

So, for example, a random walk is one of the most simple and widely used time series models. We can write a random walk model as

\[ x_t = x_{t-1} + w_t \\ w_t \sim \text{N}(0,q) \]

Applying the difference operator to this random walk will yield a time series of Gaussian white noise \(\{w_t\}\):

\[ \begin{aligned} \nabla (x_t &= x_{t-1} + w_t) \\ x_t - x_{t-1} &= x_{t-1} - x_{t-1} + w_t \\ x_t - x_{t-1} &= w_t \end{aligned} \]

Using the diff() function

In R we can use the diff() function for differencing a time series, which requires 3 arguments:

  1. x:the data

  2. lag: the lag at which to difference

  3. differences: \(d\) in \(\nabla^d x_t\).

Tips: Some thing to note when differencing:

  1. First differencing a time series will remove a linear trend (set differences = 1).

  2. Twice differencing a time series will remove a quadratic trend (set differences = 2).

  3. First differencing a time series at a lag equal to frequency will remove a seasonal trend (set lag = 12 for monthly data).

Task: Use diff() to remove the trend and seasonal signal from the CO2 time series by setting differences = 2 and plot the result.

## twice-difference the CO2 data
co2_d2 <- diff(co2, differences = 2)

## plot the differenced data
plot(co2_d2, ylab = expression(paste(nabla^2, "CO"[2])))
Time series of the twice-differenced atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i.

Time series of the twice-differenced atmospheric CO2 concentration at Mauna Loa, Hawai’i.


Note: We were apparently successful in removing the trend, but the seasonal effect is still there.

Task: Go ahead and difference the new series at lag = 12 because our data were collected monthly.

## difference the differenced CO2 data
co2_d2d12 <- diff(co2_d2, lag = 12)

## plot the newly differenced data
plot(co2_d2d12,
     ylab = expression(paste(nabla, "(", nabla^2, "CO"[2], ")")))
Time series of the lag-12 difference of the twice-differenced atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i.

Time series of the lag-12 difference of the twice-differenced atmospheric CO2 concentration at Mauna Loa, Hawai’i.


Success: Now we have a time series that appears to be random errors without any obvious trend or seasonal components!


Autocorrelation function

Recall that the autocorrelation function (ACF), which we denote as \(r_k\), estimates the correlation of a variable with itself at differing time lags

\[ r_k = \text{Cor}(x_t,x_{t+k}). \]

Recall also that a 95% confidence interval on the ACF can be estimated by

\[ \pm \frac{1.96}{\sqrt{n}} \]

where \(n\) is the number of data points used in the calculation of the ACF.

It is important to remember two things here:

  1. Although the confidence interval is commonly plotted and interpreted as a horizontal line over all time lags, the interval itself actually grows as the lag increases because the number of data points \(n\) used to estimate the correlation decreases by 1 for every integer increase in lag.

  2. Care must be exercised when interpreting the “significance” of the correlation at various lags because we should expect, a priori, that approximately 1 out of every 20 correlations will be significant based on chance alone.

Tip: We can use the acf() function in R to compute the sample ACF. Calling the function by itself will will automatically produce a correlogram. The argument lag.max allows you to set the number of positive and negative lags (Type ?acf for details).

Task: Estimate the ACF for the CO2 data out to a lag of 36 months.

## correlogram of the CO2 data
acf(co2, lag.max = 36)
Correlogram of the observed atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i obtained with the function `acf()`.

Correlogram of the observed atmospheric CO2 concentration at Mauna Loa, Hawai’i obtained with the function acf().


Note: There are 2 things about this correlogram worth noting:

  1. The \(x\)-axis has decimal values for lags, which is caused by R using the year index as the lag rather than the month.

  2. There is very high autocorrelation even out to lags of 36 months.

Deterministic time series

Let’s look at the ACF for some deterministic time series, which will help you identify interesting properties (e.g., trends, seasonal effects) in a stochastic time series, and account for them in time series models–an important topic in this course.

Task: Examine the ACF for a straight line.

## length of ts
nn <- 100

## create straight line
tt <- seq(nn)

## set up plot area
par(mfrow = c(1, 2))

## plot the line
plot.ts(tt, ylab = expression(italic(x[t])))

## plot its ACF
acf(tt)
Time series plot of a straight line (left) and the correlogram of its ACF (right).

Time series plot of a straight line (left) and the correlogram of its ACF (right).

Tip: The correlogram for a straight line is itself a linearly decreasing function over time.

Task: Examine the ACF for a sine wave and see what sort of pattern arises.

## create sine wave
tt <- sin(2 * pi * seq(nn) / 12)

## set up plot area
par(mfrow = c(1, 2))

## plot line
plot.ts(tt, ylab = expression(italic(x[t])))

## get ACF
sine_acf <- acf(tt)
Time series plot of a discrete sine wave (left) and the correlogram of its ACF (right).

Time series plot of a discrete sine wave (left) and the correlogram of its ACF (right).

Tip: The correlogram for a sine wave is itself a sine wave whose amplitude decreases linearly over time.

Task: Examine the ACF for a sine wave with a linear downward trend and see what sort of pattern arises.

## create sine wave with trend
tt <- sin(2 * pi * seq(nn) / 12) - seq(nn) / 50

## set up plot area
par(mfrow = c(1, 2))

## plot line
plot.ts(tt, ylab = expression(italic(x[t])))

## get ACF
sili_acf <- acf(tt, plot = FALSE)

## plot ACF
plot_acf(sili_acf)
Time series plot of a discrete sine wave (left) and the correlogram of its ACF (right).

Time series plot of a discrete sine wave (left) and the correlogram of its ACF (right).

Tip: The correlogram for a sine wave with a trend is itself a nonsymmetrical sine wave whose amplitude and center decrease over time.


Partial autocorrelation function

The partial autocorrelation function (PACF) measures the linear correlation of a series \(\{x_t\}\) and a lagged version of itself \(\{x_{t+k}\}\) with the linear dependence of \(\{x_{t-1},x_{t-2},\dots,x_{t-(k-1)}\}\) removed.

Tip: It’s easy to compute the PACF in R using the pacf() function, which will automatically plot a correlogram when called by itself.

Task: Look at the PACF for the CO2 data.

## PACF of the CO2 data
pacf(co2, lag.max = 36)
Correlogram of the PACF for the observed atmospheric CO<sub>2</sub> concentration at Mauna Loa, Hawai'i obtained with the function `pacf()`.

Correlogram of the PACF for the observed atmospheric CO2 concentration at Mauna Loa, Hawai’i obtained with the function pacf().

Note the following about this PACF:

  1. The partial autocorrelation at lag-1 is very high (it equals the ACF at lag-1), but the other values at lags > 1 are relatively small, unlike what we saw for the ACF.

  2. The PACF plot again has real-valued indices for the time lag, but it does not include any value for lag-0 because it is impossible to remove any intermediate autocorrelation between \(t\) and \(t-k\) when \(k=0\), and therefore the PACF does not exist at lag-0.

Tip: As with the ACF, we will see later on how the PACF can also be used to help identify the appropriate order of \(p\) and \(q\) in ARMA(\(p\),\(q\)) models.


Cross-correlation function

Often we are interested in looking for relationships between 2 different time series. There are many ways to do this, but a simple method is via examination of their cross-covariance and cross-correlation.

The sample cross-correlation function (CCF) is defined analogously to the ACF, such that

\[ r_k^{xy} = \text{Cor}(y_t,x_{t+k}). \]

Note that \(r_k^{xy} \neq r_{-k}^{xy}\), but \(r_k^{xy} = r_{-k}^{yx}\). Therefore, it is very important to pay particular attention to which variable you call \(y\) (“response”) and which you call \(x\) (“predictor”).

As with the ACF, an approximate 95% confidence interval on the CCF can be estimated by

\[ \pm \frac{1.96}{\sqrt{n}} \]

where \(n\) is the number of data points used in the calculation of the CCF, and the same assumptions apply to its interpretation.

Tip: Computing the CCF is easy with the ccf() function, which works just like acf(). In fact, ccf() is just a “wrapper” function that calls acf().

As an example, let’s examine the CCF between sunspot activity and number of lynx trapped in Canada as in the classic paper by Moran (Moran, P.A.P. 1949. The statistical analysis of the sunspot and lynx cycles. J. Anim. Ecol. 18:115-116).

To begin, let’s get the data, which are conveniently included in the {datasets} package included as part of the base installation of R. Before calculating the CCF, however, we need to find the matching years of data.

Task: Use the ts.intersect() function to join the two datasets.

## join the two data sets
data_both <- ts.intersect(sunspot.year, lynx)

## rename `sunspot.year` to `sunspots`
dimnames(data_both)[[2]][1] <- "sunspots"

Task: Plot both time series.

## plot the time series
plot.ts(data_both, yax.flip = TRUE, main = "")
Time series of sunspot activity (top) and lynx trappings in Canada (bottom) from 1821-1934.

Time series of sunspot activity (top) and lynx trappings in Canada (bottom) from 1821-1934.


Tip: In this case, it seems most relevant to treat lynx as the \(y\) and sunspots as the \(x\) in ccf(x, y, ...), in which case we are interested in the CCF at negative lags (i.e., when sunspot activity predates lynx abundance).

Task: Compute the CCF between sunspot activity and the log of lynx abundance.

## get sunspot data
sunspots <- data_both[, "sunspots"]

## get  lynx data
lynx <- data_both[, "lynx"]

## CCF of sunspots and log(lynx)
ccf(sunspots, log(lynx), ylab = "Cross-correlation")
CCF for annual sunspot activity and the log of the number of lynx trappings in Canada from 1821-1934.

CCF for annual sunspot activity and the log of the number of lynx trappings in Canada from 1821-1934.


It looks like lynx numbers are relatively low 3-5 years after high sunspot activity (i.e., significant correlation at lags of -3 to -5).